Install packages & load libraries necessary.

## Warning: package 'BiocGenerics' was built under R version 4.0.3
## Warning: package 'IRanges' was built under R version 4.0.3
## Warning: package 'S4Vectors' was built under R version 4.0.3
## Warning: package 'circlize' was built under R version 4.0.3
## Warning: package 'kableExtra' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3

Recap of A1

The Expression Dataset I chose was GSE97356: Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study.

A bit about the dataset

gse <- getGEO("GSE97356",GSEMatrix=FALSE)

Dataset details:

Dataset platform details:

Name of Platform: Illumina HiSeq 2000 (Homo sapiens)

Submission Data Date: Nov 02 2010

Last Date Data was Updated: Mar 27 2019

Organisms Included in Platform: Homo sapiens

Quantity of GEO datasets that use this Platform: 8531

Quantity of GEO samples that use this Platform: 140818

Clean the Data and Map to HUGO Symbols

if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
sfiles <- getGEOSuppFiles('GSE97356')
fnames = rownames(sfiles) 
reads = read.delim(fnames[1],header=TRUE,sep = ",")
#rename first column to genes, as it's more meaningful than X
colnames(reads)[1] <- c("Gene")

The dimensions of my dataset are: 25830, 325

This means there are 25830 rows and 325 columns (the rows indicate genes and the columns are samples, ignoring the first column which delimits that the rows are genes).

I have three kinds of samples in my 324 subjects, based on the study design. The study design says “324 human samples; 201 never, 82 current, 41 past with PTSD”. I want to get a list of which subjects have never/current/past PTSD diagnosis to compare to my reads dataframe.

for(i in 1:324) {
    status <- gse@gsms[[i]]@header["characteristics_ch1"][[1]][1]
    case <- gse@gsms[[i]]@header["title"]
    if(i==1){
        metainfo <- data.frame("status"= status, "case" = case)
    }
    else{
        currentMetainfo <- data.frame("status"= status, "case" = case)
        metainfo <- rbind(metainfo, currentMetainfo)
    }
}
#for a sanity check ensure correct number of subjects with current/never/past PTSD diagnosis
occurences<-table(unlist(metainfo$status))
print(occurences)
## 
## ptsd: Current   ptsd: Never    ptsd: Past 
##            81           201            42

The metainformation found in the gse shows 81 subjects with current PTSD, 201 subjects who never had PTSD, and 42 who had past PTSD diagnosis. This doesn’t match the study design… but I looked at the paper and the published paper shows: “201 with never, 81 current, 42 past PTSD”, so we are actually correctly parsing the data, the study design is incorrectly labelled on GEO.

Filter weakly expressed features from my dataset

I’m going to filter lowly expressed genes using edgeR. Since I am keeping the genes that have at least 1 count per million within smallest group sample size, which is 42 (past PTSD), I’m looking for genes that have at least 1 count per million at least 42 times.

cpms = cpm(reads[,2:325])
#add in the genes associated with rownames
rownames(cpms) <- reads[,1]
qualityReads = rowSums(cpms >1) >= 42
filteredReads = reads[qualityReads,]

The filtered dimensions of the dataset now are: 14184, 325.

During filtering we removed 11646 genes.

Checking for duplicates after filtering

I’m going to see if the dataset has duplicates:

filteredReadsDf <- data.frame(table(filteredReads$Gene))
colnames(filteredReadsDf) <- c("Gene", "GeneOccurence")
duplicates <- filteredReadsDf %>% filter(GeneOccurence > 1)

It looks like I don’t have any duplicated gene names, the number of duplicated gene names is 0 .

Let’s take a look at how the gene names are stored:

head(filteredReadsDf$Gene, 5)
## [1] A1BG     A1BG-AS1 A2M      A2M-AS1  A2MP1   
## 14184 Levels: A1BG A1BG-AS1 A2M A2M-AS1 A2MP1 A3GALT2 AAAS AACS ... ZZZ3

Looking at gene AADACL2-AS1 for example, I see it is AADACL2 Antisense RNA 1. These appear to be gene ids, but I want to convert to HUGO symbols. I’m going to see if I can find the corresponding ensembl ids.

Mapping to Ensembl Ids

grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice") #from https://support.bioconductor.org/p/62064/
ensembl_grch37 = useDataset("hsapiens_gene_ensembl",mart=grch37)
idToEnsembl <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
               filters = c("wikigene_name"),
               values = filteredReads$Gene,
               mart = ensembl_grch37)

I can see that I have 0 gene ids that do not map to ensembl ids, so I’m not concerned. I’m now going to convert the ensembl ids to HUGO symbols.

Mapping to HUGO symbols

#first add the ensembl id labels to filteredReads
colnames(idToEnsembl) <- c("Gene", "EnsemblId")
filteredReads <- merge(idToEnsembl, filteredReads, by="Gene")
#now lets get the corresponding HUGO symbols
ensemblToHUGO <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
                   filters = c("ensembl_gene_id"),
                   values = filteredReads$EnsemblId,
                   mart = ensembl_grch37)

Now I have 413 ensembl ids that do not map to HUGO symbols. Lets see how many ensembl ids are duplicated.

Checking for duplicates in Ensembl Ids

filteredReadsDfEnsembl <- data.frame(table(filteredReads$EnsemblId))
colnames(filteredReadsDfEnsembl) <- c("EnsemblGene", "GeneOccurence")
duplicates <- filteredReadsDfEnsembl %>% filter(GeneOccurence > 1)
length(duplicates$EnsemblGene)
## [1] 549

I have 549 duplicated ensembl ids. I wonder how large the overlap between the gene names that are duplicated and those that do not have HUGO symbols is.

#first add the HUGO symbols to to filteredReads
colnames(ensemblToHUGO) <- c("EnsemblId", "HUGOSymbol")
filteredReads <- merge(ensemblToHUGO, filteredReads, by="EnsemblId")

emptyHUGO <- filteredReads %>% filter(HUGOSymbol == "")
duplicatedEnsembl <- data.frame(duplicates$EnsemblGene)
colnames(duplicatedEnsembl) <- c("EnsemblId")
overlapMissingDuplicated <- merge(emptyHUGO, duplicatedEnsembl, by="EnsemblId")
length(overlapMissingDuplicated$EnsemblId)
## [1] 117

It looks like there are 117 gene names that are duplicated ensembl ids and do not have HUGO symbols. The gene names that are missing HUGO symbols make up 3.0299283 % of the dataset. As some of these, 21.3114754 % exactly, have no corresponding HUGO symbol, but all have unique wikigene names, I’m not sure whether these genes are being mapped correctly and don’t feel comfortable removing them.

Additionally, according to lecture 4 slides 26 “if we base our analysis on ensembl gene ids then they are unique elements” as a comment not to necessarily filter out duplicates, and as I have removed lowly expressed genes I feel this is enough at this stage. Instead I’m going to try and preserve all these genes with duplicated names and just create versions of them so theyre unique. This is because the end goal of this assignment is to produce a dataframe with 324 numeric columns where each row has unique HUGO symbols as rownames. Lets see how many HUGO symbols are duplicated:

filteredReadsDfHUGO <- data.frame(table(filteredReads$HUGOSymbol))
colnames(filteredReadsDfHUGO) <- c("HUGOGene", "GeneOccurence")
HUGOduplicates <- filteredReadsDfHUGO %>% filter(GeneOccurence > 1)
length(HUGOduplicates$HUGOGene)
## [1] 1061

There are 512 more duplicated HUGO symbols than ensembl ids. This is a lot.

duplicatedHUGO <- data.frame(HUGOduplicates$HUGOGene)
colnames(duplicatedHUGO) <- c("HUGOSymbol")
duplicatedHUGOFull <- merge(duplicatedHUGO, filteredReads, by="HUGOSymbol")
duplicatedHUGOEnsembl <- merge(duplicatedHUGOFull, duplicatedEnsembl)
length(unique(duplicatedHUGOEnsembl$EnsemblId))
## [1] 549

This shows at least that all the duplicated ensembl ids also have duplicated HUGO symbols, so I’m really contending with the 512 duplicated HUGO symbol genes. I’m going to make the namings of these symbols unique so I can keep them all as discussed above. First I’m going to remove the ensembl ids that do not map to HUGO symbols, of which 117 are duplicated.

filteredReadsFinal <- filteredReads %>% filter(HUGOSymbol != "")

I lost 490 genes that didn’t have HUGO symbols. Now I’m going to make the namings of the HUGO symbols unique.

filteredReadsFinal <- filteredReadsFinal[, c(2,4:ncol(filteredReadsFinal))]
filteredReadsFinal$HUGOSymbol <- make.names(filteredReadsFinal$HUGOSymbol, unique=TRUE)
filteredReadsFinal <- tibble::column_to_rownames(filteredReadsFinal, var="HUGOSymbol")

Apply Normalization

Visualize the data

countDensity <- apply(log2(cpm(filteredReadsFinal)), 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(countDensity)) {
  xlim <- range(c(xlim, countDensity[[i]]$x));
  ylim <- range(c(ylim, countDensity[[i]]$y))
}
cols <- rainbow(length(countDensity))
ltys <- rep(1, length(countDensity))
p1 <- plot(countDensity[[1]], xlim=xlim, ylim=ylim, type="n",
     ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
for (i in 1:length(countDensity)) lines(countDensity[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(dataToPlot),
       col=cols, lty=ltys, cex=0.75,
       border ="blue",  text.col = "green4",
       merge = TRUE, bg = "gray90")


We can see some data is not following the same overall pattern as the other samples. Let’s normalize.

Normalizing

Let’s see if we can understand what happened visually.


The boxplot is difficult to interpret, but you do see an adjustment towards the median especially between cases 181 to 223. The count density graph does also show a change in the dataset distribution after normalization. The patients are more tightly clustered, for one.

MDS Plot

plotMDS(d, labels=(metainfo$title), col = c("green","blue", "red")[factor(metainfo$status)])


Here the colour-coding is defined as follows: Blue are patients that have never had PTSD, red are patients that have been diagnosed in the past with PTSD, and green are patients that currently have PTSD.

Since MDS plot represents the distances between samples, we would hope to see past,never and current patients cluster together, but we don’t quite see this.

Interpret the Data

  1. What are the control and test conditions of the dataset?
    The control and test conditions are actually stratified into three groups, patients with no PTSD diagnosis ever (never), patients that currently have PTSD diagnosis (current), and patients that previously have been diagnosed with PTSD (past).
  2. Why is the dataset of interest to you?
    I find the dataset interesting as my research project placement work for the Bioinformatics Specialist has been done at the CAMH Krembil Centre for Neuroinformatics. I have focused on analyzing cell type proportion changes in bulk-tissue RNAseq sampled from patients with Alzheimer’s disease. As a member of two different labs for BCB330 and BCB430 I have also been exposed to research on addiction & mental health & more broadly on how the brain changes in response to various trauma. Therefore I find it very interesting to continue to explore how gene expression patterns in the brain can be used to understand the neuropathologies and symptoms of mental illness/neurodegenerative disorders.
  3. Were there expression values that were not unique for specific genes? How did you handle these?
    The expression values I started with were unique, but once mapped to ensembl ids there were duplicates. I decided to keep them in the dataset upon further analysis as described above.
  4. Were there expression values that could not be mapped to current HUGO symbols?
    Yes. There were 0 ensembl ids that did not map to HUGO symbols.
  5. How many outliers were removed?
    During filtering we removed 9658 genes.
  6. How did you handle replicates?
    I chose not to remove them (see section: Checking for duplicates again).
  7. What is the final coverage of your dataset?
    I lost 10148 genes in filtering and mapping to HUGO Symbols, and 0 subjects or genes in normalization.

Differential Gene Expression

We want to to visualize our normalized data using a heatmap to get an idea of gene expression patterns.

I will take a closer look at the clustering of patients in my dataset.

Once again, it is clear to see that the patients are not clustering by the metrics evaluated in the study, i.e. there are not three separate clusters of samples corresponding to the PTSD presence/absence/past diagnosis categories. The majority of the patients are clustering around 0 in the leading logFC dim1 and dim2.

Another way to visualize the clustering is to colour by patient, but seeing as there aren’t two samples from one patient under differing conditions (i.e. before PTSD diagnosis and after, the following plot is not truly helpful in determining clustering patterns in the patients, definiteluy not as useful as if there were test and control conditions for each subject in the dataset.)

Limma Analysis

I’ll begin by creating a linear model using Limma (Ritchie et al., 2005), starting with a design matrix.

(Intercept) metainfo\(status Never </th> <th style="text-align:right;"> metainfo\)status Past
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 0 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 0 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 0 0
1 0 0
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 0 1
1 1 0
1 0 0
1 1 0
1 0 1
1 0 0
1 0 1
1 0 0
1 0 1
1 0 1
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 0 0
1 1 0
1 0 0
1 1 0
1 0 1
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 0 0
1 0 0
1 0 0
1 0 0
1 1 0
1 1 0
1 1 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 1
1 0 0
1 0 0
1 0 0
1 1 0
1 1 0
1 0 0
1 0 0
1 0 0
1 0 1
1 0 1
1 0 0
1 1 0
1 0 0
1 0 0
1 0 1
1 0 1
1 0 0
1 0 1
1 1 0
1 1 0
1 1 0
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 0 1
1 0 1
1 1 0
1 1 0
1 0 0
1 0 1
1 0 0
1 0 0
1 1 0
1 1 0
1 0 0
1 0 1
1 1 0
1 0 0
1 0 0
1 0 1
1 1 0
1 0 1
1 1 0
1 0 1
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 0 1
1 0 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 0 0
1 0 1
1 0 0
1 1 0
1 0 0
1 0 1
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 0 1
1 1 0
1 0 0
1 0 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 1 0
1 0 0
1 1 0
1 1 0
1 0 1
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 0 1
1 1 0
1 0 0
1 1 0
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 0
1 1 0
1 1 0
1 0 0
1 0 1
1 1 0
1 0 1
1 0 0
1 0 1
1 1 0
1 1 0
1 0 0
1 1 0
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 1 0
1 0 1
1 1 0
1 1 0
1 0 0

The next step is to create a data matrix and fit the data to the above model. This will be done using empirical Bayes to compute differential expression (note trend=TRU because this si RNAseq data). From this a table will be generated which contains p values and adjusted p values for gene expressions calculated.

x logFC AveExpr t P.Value adj.P.Val B
11814 SETD7 -0.7443755 6.340630 -3.745703 0.0002130 0.9900089 -3.794917
2117 CD44 -7.8225238 122.273084 -3.203193 0.0014948 0.9900089 -4.021657
5030 GPN3 0.1983226 1.170303 3.196877 0.0015269 0.9900089 -4.024114
15347 ZNF439 0.2208321 1.076629 3.188443 0.0015707 0.9900089 -4.027389
11677 SDHC -0.3881468 5.171646 -3.172328 0.0016577 0.9900089 -4.033625
9003 NT5C 0.1702865 1.112761 3.132173 0.0018942 0.9900089 -4.049041
2198 CDK12 -1.1712040 16.059256 -3.115183 0.0020034 0.9900089 -4.055511
8900 NPIPA2 -0.1202229 0.566841 -3.050312 0.0024756 0.9900089 -4.079927
8903 NPIPA3.1 -0.1202229 0.566841 -3.050312 0.0024756 0.9900089 -4.079927
8905 NPIPA5 -0.1202229 0.566841 -3.050312 0.0024756 0.9900089 -4.079927
## [1] 495
## [1] 0

Multiple Hypothesis Testing*

As there are multiple tests being done to check the significance of so many genes, we have to correct for multiple tests. This is due to the fact that the more tests are performed, the greater the likelihood of a false positive occurring by chance.

Lecture slides suggested that to minimize the false discovery rate the linear model should be updated to account for both the patients’ gene expression as well as the number of patients (note the previous model just focused on the patients’ gene expression).This could help us to control for patient variability.

However, I received the message : Error in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim, : No residual degrees of freedom in linear model fits when I tried to fit the model below using empirical Bayes as I don’t have replicated samples (Bioconductor Support, 2016). Controlling for patient variability through multiple testing does not apply to this dataset. Therefore the previous model using the Benjamni - Hochberg correction method is sufficient.

modelDesignPat <- model.matrix(~ metainfo$title + metainfo$status)
modelDesignPat[1:10,1:5]
##    (Intercept) metainfo$titleCase_10 metainfo$titleCase_100
## 1            1                     0                      0
## 2            1                     0                      0
## 3            1                     0                      0
## 4            1                     0                      0
## 5            1                     0                      0
## 6            1                     0                      0
## 7            1                     0                      0
## 8            1                     0                      0
## 9            1                     0                      0
## 10           1                     1                      0
##    metainfo$titleCase_101 metainfo$titleCase_102
## 1                       0                      0
## 2                       0                      0
## 3                       0                      0
## 4                       0                      0
## 5                       0                      0
## 6                       0                      0
## 7                       0                      0
## 8                       0                      0
## 9                       0                      0
## 10                      0                      0

EdgeR Analysis

I looked at how the “Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study” paper performed differential expression analysis to get a better idea of where to go from here. The differential expression analysis they did was done using “DESeq252 software based on negative binomial generalized linear models, adjusting for age, race and the five cell type proportions (CD8T, CD4T, natural killer, Bcell, monocytes) in discovery and replication cohorts, respectively”.

This gave me reason to believe that I should not be using the Limma package, as it works best with data that has an underlying linear distribution.I decided to try EdgeR (Robinson et al., 2010), a package for data that works best when the data follows a negative binomial distribution.

A negative binomial distribution has a specific shape when plotted, so I can easily verify that my data is not following a linear distribution by seeing if it adheres to the shape of the curve of a negative binomial distribution.

The data clearly follows a negative distribution (as plotted by the red X’s that follow the path of the blue line), meaning the EdgeR package will be well suited for differential analysis.The next step, as with Limma from before, is to to fit the data to the model.
logFC logCPM F PValue FDR
HBZ 2.5503142 3.056098 26.62521 4.0e-07 0.0046879
SIAE -0.8571986 2.713223 25.90904 6.0e-07 0.0046879
PCBP2 -0.2029574 5.129198 24.01898 1.5e-06 0.0077812
HBM 1.1931601 4.138682 22.18119 3.6e-06 0.0118740
GRB10 -0.5105582 3.005997 21.82027 4.3e-06 0.0118740
RAP2C.AS1 -0.4267586 5.391710 21.55919 4.9e-06 0.0118740
KANK1 -0.4887412 3.004285 21.41282 5.3e-06 0.0118740
L2HGDH -0.2504104 3.711189 20.98273 6.5e-06 0.0128306
CORIN -0.5377872 1.973798 20.50985 8.3e-06 0.0143900
UST -0.3229349 4.186808 20.19212 9.7e-06 0.0151451
x
BH
x
metainfo$status Never
x
glm

Using this new underlying distribution assumption, I want to see if I have any significantly differentially expressed genes now:

## [1] 2226
## [1] 32

Excitingly, I have 2226 genes pass the threshhold p value and 32 pass FDR (Benjamini Hochberg) correction.

Now that I have some differently expressed genes, let’s view these bad boys in a volcano plot!

Volcano Plot

Let’s take a look at upregulated genes.

There are not a lot of upregulated genes (clearly… there’s only two red dots), so let’s see what the volcano plot looks like with downregulated genes.

Seems to be a bit more interesting.

Top Hits

Now I want to make another heatmap and see if there is any clustering that is happening by PTSD: past, never, current condition. I will use ComplexHeatmap (Gu, 2016).

324 samples make the column names difficult to read for the heatmap, but they are ordered by “current”, “never” and “past” PTSD diagnosis as seen by Case_10, Case_14, Case_16, Case_19, Case_32. There doesn’t seem to be extremely obviously clustering by diagnosis, but in the bottom right corner of the heatmap there is a section that’s more blue and preceding it is a section that is more pale red, it’s also more pale red at the beginning.

  1. Calculate p-values for each of the genes in your expression set (see Limma Analysis and then see EdgeR Analysis). How many genes were significantly differentially expressed? What thresholds did you use and why?

Using limma 495 genes pass the threshhold p value and 0 genes pass correction. No values are therefore significant after correction.

Using EdgeR 2226 genes pass the threshhold p value and 32 genes pass FDR (BH) correction.

I chose the threshhold p value of 0.05 as it is accepted across scientific literature as significant. There is little benefit to being more stringent especially since I am doing multiple-hypothesis testing anyways.

  1. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method (see Multiple Hypothesis Testing). Which method did you use? And Why? How many genes passed correction?

See Multiple Hypothesis Testing. I used the Benjamini-Hochberg (BH) method, as it is less strict than Bonferroni but still accepted as robust. There were 32 genes that passed the FDR (BH) correction.

  1. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot (see Volcano Plot). Highlight genes of interest.

See Volcano Plot.

  1. Visualize your top hits using a heatmap (see Top Hits). Do your conditions cluster together? Explain why or why not.

See Top Hits.

Thresholded over-representation analysis

Let’s start by getting the upregulated and downregulated genes.

Getting Genes of Interest

I’ll be using EdgeR as that is the method that permitted me to find differentially expressed genes.

## [1] 71
## [1] 2155

There are 71 upregulated genes, and 2155 downregulated genes, which is pretty spicy. I want to get the EnsemblIDs to put them into the g:profiler webpage to find matches so the next step is to get the EnsemblIDs and save them into a text file.

Annotation Data

The annotation data sets I selected on g:profiler (Raudvere et al., 2019) GO biological process, Reactome, and WikiPathways, as they were the datasets we had used in a previous assignment and I felt comfortable interpreting them. The version I used was e102_eg49_p15_7a9b4d6.

Genesets Returned

I used the same significance thresholds as when I performed gene expression analysis (0.05) with multiple-hypothesis testing correction, i.e. with the FDR-corrected p values. This was done using the Benjamini-Hochberg method, and again, the threshhold for p values was 0.05, as this is the standard.

Up Regulation Results

Down Regulation Results

All Gene Results

Screenshots of gProfiler results

Upregulated genes

(No Wiki Pathways Results)

The genesets returned top terms related to hydrogen peroxide catabolic process and interferon signaling, generally related to immune system and metabolic processes (Gongora et al., 1999). This suggests the upregulated genes are involved in antiviral defense/immunity.

Downregulated genes

The genesets returned top terms related to protein modification enzymes, chromatin modification enzymes and the EGF/EGFR signal pathway which is involved in growth, differentiation, migration, adhesion and cell survival (WikiPathways, 2020). This indicates the downregulated genes are involved in cell growth and differentiation.

All Genes

The top term genesets for all genes were dominated by the downregulated gene results, again related to cell differentiation/survival.

  1. Which method did you choose and why?

I chose EdgeR as it had previously returned differentiated genes in the dataset, and the undelying assumptions of the data distributions matched this dataset best.

  1. What annotation data did you use and why? What version of the annotation are you using?

See Annotation Data.

  1. How many genesets were returned with what thresholds?

See Genesets Returned.

  1. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

See Genesets Returned. The results for all genes together were overall dominated by the the down-regulated genes, putting the genes together only returned 9 more genes in the intersection of T and Q. The terms returned were related to the same overarching themes as for the downregulated genes. However, there’s a clear distinction in term results between up-regulated and down-regulated genes (immune response/antiviral defense and cell differentiation/growth respectively).

  1. Present your results with the use of tables and screenshots. All figures should have appropriate figure legends.

See Genesets Returned.

Interpretation

  1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

The paper discussed they had found enrichment in “glucocorticoid receptor signaling and immunity-related pathways” but mentioned that these pathways were not robust to FDR correction. The paper also mentioned that gene FKBP5, with EnsembleID ENSG00000096060 was upregulated in patients with PTSD, which is found in the downregulated genes result of my analysis (downregulated in non-PTSD subjects as compared to those with PTSD). The glucocorticoid receptor pathway is a component of the immune response and we can see these immune response pathways come up in my analysis as well (upregulated gene pathways, which would be downregulated in PTSD patients).

  1. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

There is a lot of evidence supporting immune dysregulation in PTSD patients, which supports the results found in my upregulated gene pathway analysis. For example, the paper “Inflammatory markers in post-traumatic stress disorder: a systematic review, meta-analysis, and meta-regression” showed increased inflammation in subjects with PTSD compared to healthy controls (Passos et al., 2015). Another poignant example of this finding was the paper “Posttraumatic stress disorder and physical illness: results from clinical and epidemiologic studies” which compiled evidence to show that patients with PTSD may be at risk for autoimmune diseases (Boscarino, 2004).

References

  1. Bioconductor Support. (2016). ERROR: No residual degrees of freedom in linear model fits. Retrieved March 15, 2021, from https://support.bioconductor.org/p/59168/

  2. Boscarino JA (2004). Posttraumatic stress disorder and physical illness: results from clinical and epidemiologic studies. Ann N Y Acad Sci.1032:141-53. doi: 10.1196/annals.1314.011. PMID: 15677401.

  3. Gongora C, Mechti N. L’interféron: un mécanisme complexe de signalisation [Interferon signaling pathways] (1999). Bull Cancer. 86(11):911-9. French. PMID: 10586107.

  4. Gu, Z. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.

  5. Kuan, P., Waszczuk, M. A., Kotov, R., Clouston, S., Yang, X., Singh, P. K., . . . Luft, B. J. (2017). Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study. Translational Psychiatry, 7(12). doi:10.1038/s41398-017-0050-1

  6. Passos IC, Vasconcelos-Moreno MP, Costa LG, Kunz M, Brietzke E, Quevedo J, Salum G, Magalhães PV, Kapczinski F, Kauer-Sant’Anna M (2015). Inflammatory markers in post-traumatic stress disorder: a systematic review, meta-analysis, and meta-regression. Lancet Psychiatry, 2(11):1002-12. doi: 10.1016/S2

  7. Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.

  8. Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140

  9. Uku Raudvere, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, Jaak Vilo: g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Research 2019; doi:10.1093/nar/gkz369 [PDF].

  10. Wiki Pathways. (2020, June 30). EGF/EGFR Signaling Pathway (Homo sapiens). Retrieved from https://www.wikipathways.org/index.php/Pathway:WP437